Also called multinomial logistic regression, is a generalization of logistic regression to the case where we want to handle multiple classes.
Specification
Hypothesis:
(x: m×d, θ: d×K)
hθ(x)=⎡⎣⎢⎢⎢⎢P(y=1|x;θ)P(y=2|x;θ)⋮P(y=K|x;θ)⎤⎦⎥⎥⎥⎥=1∑Kj=1exp(θ(j)⊤x)⎡⎣⎢⎢⎢⎢⎢exp(θ(1)⊤x)exp(θ(2)⊤x)⋮exp(θ(K)⊤x)⎤⎦⎥⎥⎥⎥⎥
Cost function (cross entropy):
J(θ)=−⎡⎣⎢⎢∑i=1m∑k=1K1{y(i)=k}logexp(θ(k)⊤x(i))∑Kj=1exp(θ(j)⊤x(i))⎤⎦⎥⎥
Optimize with gradient descent:
∇θ(k)J(θ)=−∑i=1m[x(i)(1{y(i)=k}−P(y(i)=k|x(i);θ))]
Redundancy of Parameters
The actual number of parameters are just (K−1)n, rather than Kn, because probabilities always sum to 1. We can find that subtracting ψ from every θ(j) does not affect our hypothesis’ predictions at all:
P(y(i)=k|x(i);θ)=exp((θ(k)−ψ)⊤x(i))∑Kj=1exp((θ(j)−ψ)⊤x(i))=exp(θ(k)⊤x(i))exp(−ψ⊤x(i))∑Kj=1exp(θ(j)⊤x(i))exp(−ψ⊤x(i))=exp(θ(k)⊤x(i))∑Kj=1exp(θ(j)⊤x(i)).
So one can instead set
θ(K)=0⃗ and optimize only with respect to the remaining parameters.
Note that
J(θ) is still convex, but the Hessian is singular, which causes a straightforward implementation of Newton’s method to run into numerical problems.
Reference
http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/